Bad Hash Clusters
Hash clusters are a great way to reduce IO on some tables, but they have their downside.
- If too little space is reserved for each key, or if the cluster is created with too few hash keys, then each key will split across multiple blocks negating the benefits of the cluster.
- If too much space is reserved for each key, or if the cluster is created with too many hash keys, then the cluster will contain thousands of empty blocks that slow down full table scans (see Empty Hash Cluster).
Oracle provides some reasonable advice on selecting the size per cluster key and number of hash keys. See the Oracle online doco SQL Reference manual, CREATE CLUSTER statement.
How can you tell if your cluster is inefficient?
You need to know two things: the distribution of your data, and the config settings of your cluster.
Run the following SQLs:
SELECT cluster_name, cluster_type, key_size, hashkeys, avg_blocks_per_key
FROM user_clusters
SELECT count(*) as keycount, avg(c) as mean, stddev(c) as stddev
FROM (
SELECT count(*) as c
FROM {clustered table name}
GROUP BY {cluster col1} [, {cluster col2} ...]
)
SELECT table_name, avg_row_length
FROM user_tables
Repeat the 2nd sql for each table in the cluster.
For each table in the cluster, multiply the avg_row_length (3rd SQL above) by the mean to determine the average number of bytes per cluster key. Add these results together if there were mutiple tables in the cluster - call this number avg_key_size.
In the first SQL, Oracle reserves key_size bytes for each hash key, and hashkeys hash keys. ie. The minimum table size is therefore key_size x hashkeys bytes.
In the second SQL(s), keycount is the number of different cluster keys, mean is the average number of rows per cluster key, and stddev is the standard deviation of the number of rows per cluster key.
- Is hashkeys bigger than keycount (use the highest value of keycount if there are multiple tables in the cluster)? Much bigger; say 25% bigger?
If so, you have a lot of empty blocks waiting to be filled. If you are expecting lots of new key values in the future, this is OK. But in the meantime your Full Table Scans will be slow.
- Is hashkeys smaller than keycount (use the highest value of keycount if there are multiple tables in the cluster)?
If so, then you have multiple cluster keys hashing to the same hash key and sharing the same space. This is only a problem if avg_blocks_per_key is greater than 1, because it means you are reading 2 or more blocks when you only need to read one. If that is the case, you should either make key_size bigger (but no bigger than the size of 1 block - ask your DBA what block size you are using), or make hashkeys bigger.
- If hashkeys is similar to keycount, then is avg_blocks_per_key per key greater than 1?
This means rows for one cluster key are spilling across multiple blocks. If key_size is greater than 1 block, then this is OK because you expect so many rows per cluster key that they cannot physically fit into one block. If key_size is smaller than one block, then make it bigger.
- Is the avg_key_size less than 75% of key_size?
If so, this may be OK if stddev is relatively large (say more than 25% of mean), otherwise it's a waste of space and will slow down full table scans. Larger values of stddev mean that the number of rows for any given cluster key vary significantly from the mean, so more space needs to be allocated to avoid splitting cluster keys. However if stddev is relatively small, then key_size should be set closer to avg_key_size to save space.
©Copyright 2003